Updating content index for content searches on networks

ABSTRACT

According to an aspect of the present invention, update requests indicating the changes in content are created external to an application causing the changes, and a content index is updated based on the update requests. In an embodiment, the changes are detected based on examining packet contents on the way to a data repository (e.g., database server). As a result, the overhead in data repositories as well as any crawlers updating the content index may be reduced.

RELATED APPLICATIONS

The present application is related to and claims priority from theco-pending India Patent Application entitled, “UPDATING CONTENT INDEXFOR CONTENT SEARCHES ON NETWORKS”, Serial Number: 1533/CHE/2006, Filed:Aug. 25, 2006, naming the same inventors as in the subject patentapplication.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to management of electronic records usedby systems such as search engines, and more specifically to a method andapparatus for updating content index for content searches on a network.

2. Related Art

There is a recognised need in the market place to search the contentprovided on various networks and provide the results of searches. Forexample, a user may access a search engine by a suitable interface(e.g., provided by a web browser) and provide a search criteria (or aquery, in general). The search engine performs a search for contentmatching the search criteria, and sends the results to the user.

The content accessible on the networks is often indexed and stored toenable a faster/efficient search. In general, the indexes are examinedto determine the matching content and/or various attributes such asmetadata representing the security features (read write permissions),status (last modified/accessed date), access control (who can access thedata), etc.

One general requirement is that the indexes be updated to reflect thechanges in the content (to be searched). If the indexes are notaccurately updated, the search results may correspondingly be erroneous.

In one prior embodiment, a crawler parses databases storing the contentsat regular intervals, and updates the indexes to reflect the changes.However, a shorter crawling interval leads to use of undesirableprocessor time/power. Alternately, a longer interval fails to update thechanges for a corresponding long time, thereby providing stale, old orinvalid results to a search request (query).

Accordingly, what is needed is a efficient updating of content index foraccurate search results without loading the processor time and power.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be described with reference to theaccompanying drawings briefly described below.

FIG. 1 is a block diagram of an example environment in which variousaspects of the present invention can be implemented.

FIG. 2 is a flowchart illustrating the manner in which content index isupdated according to an aspect of the present invention.

FIGS. 3A is a block diagram illustrating the details of an applicationserver in one embodiment.

FIGS. 3B is a block diagram illustrating the details of an applicationserver in an alternative embodiment.

FIG. 4 is a diagram illustrating the manner in which a packet format canbe used to determine the changes to desired content in one embodiment.

FIG. 5 is a block diagram illustrating the manner in which the packetson a network can be monitored to determine the changes to desiredcontents in an embodiment.

FIG. 6 is a block diagram illustrating an example embodiment in whichvarious aspects of the present invention are operative when softwareinstructions are executed.

In the drawings, like reference numbers generally indicate identical,functionally similar, and/or structurally similar elements. The drawingin which an element first appears is indicated by the leftmost digit(s)in the corresponding reference number.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

1. Overview

According to an aspect of the present invention, updates as to changesin content are detected external to applications causing the changes,and a content index is updated with the changes. Due to such detectionexternal to the applications, the overhead on applications may bereduced, in addition to reduction in development of the applicationlogic (software and/or hardware).

In an embodiment, the detection is based on examining packet contents onthe way to a data repository (e.g., database server) which stores thecontent of interest. As a result, the overhead in data repositories aswell as any crawlers updating the content index may be reduced.

In one implementation, the packets are examined in a web/applicationserver which is in the path from a client system to the data repository.In another implementation, a monitoring device which passively monitorspackets destined to the data repository, is provided. The changes may bepropagated by sending the appropriate packets to a crawler or by storingthe change information in a secondary storage, with the crawlerpropagating the changes later to the content index/database.

According to another aspect of the present invention, database triggersare provided to propagate the change information to the content indexwhen changes are made in the contents of the database.

Several aspects of the invention are described below with reference toexamples for illustration. It should be understood that numerousspecific details, relationships, and methods are set forth to provide afull understanding of the invention. One skilled in the relevant art,however, will readily recognize that the invention can be practicedwithout one or more of the specific details, or with other methods, etc.In other instances, well_known structures or operations are not shown indetail to avoid obscuring the features of the invention.

2. Example Environment

FIG. 1 is a block diagram illustrating an example environment in whichvarious aspects of the present invention can be implemented. Theenvironment is shown containing client systems 11A-11C, network 120,web/application server 130, search engine 140, index database 150,crawler 160, and database server 170. Each system is described infurther detail below.

Client systems 110A-110C enable users to access various applicationsexecuting on application server 130, and also to send search requests tosearch engine 140. In general, packets are sent and received on network120 when a client system accesses the applications or sends searchrequests. The commands contained in the packets may cause applicationserver 130 to change the content stored in database server 170.Similarly, client systems 110A-110C may send search requests to searchengine 140 and receive the results in corresponding packets also.

Database server 170 (example of a data repository) stores content, whichcan be searched via search engine 140, as described in sections below.The content can represent any information (or stored entities) such asweb pages, email messages, file structure/content, tables, etc.Application server 130 executes a number of applications and causeschanges (in addition to read operations) in the content (data ormetadata) stored in database server 170, typically in response tocommands received in packets from client systems 110A-110C.

In an embodiment, database server 170 is implemented using Oracle 9isoftware available from Oracle International Inc., the intended assigneeof the subject patent application, and application server 130 isimplemented using Oracle Application Server 10g, available from OracleCorporation, the intended assignee of the subject patent application.Though a single database server and a single application server areshown in FIG. 1, it should be appreciated that typical environmentswould contain many of such systems, and substantially more contentstored in such database servers may be managed according to variousaspects of the present invention.

Index database 150 stores a content indexes (indices), which facilitatessearch (and possibly easy access) of content stored in various sources,including database server 170. The index information may indicate thekey words for which the corresponding document needs to be considered asa match and various attribute information corresponding to the document,noted above. The index information is generally organized consistentwith the search strategy of search engine 140.

Search engine 140 receives search requests from client systems 110A-110Cand examines the content index (by retrieving the corresponding data onpath 145) to process the search requests. It may be appreciated that theaccuracy of the search results depends on the accuracy of the contentindex in representing the state of various contents of interest.

Crawler 160 is designed to update content index in index database 150.Crawler 160 receives data representing the changes to contents ofinterest, and updates the content index accordingly. In an embodiment,the data is received by crawling (retrieves and examines) the content invarious databases at regular intervals. Such an approach may havevarious disadvantages, as noted above.

Alternatively, crawler 160 may asynchronously (without crawling) receiveupdate requests from various sources according to various aspects of thepresent invention, and update the content index accordingly. The updaterequests can be formed using various protocols (e.g., built on top ofsocket interfaces provided in TCP/IP environments), as will be apparentto one skilled in the relevant arts by reading the disclosure providedherein.

Crawler 160, in turn, can be implemented using various approaches wellknown in the relevant arts. For example, the updates can be received viaJMS (Java messaging Service), Oracle Advanced Queues (AQ) or simpleDatabase tables. Crawler 160 may then be configured to subscribe to aJMS Queue or AQ, which will be indicated via Callbacks (or interrupts)when there is a message ready for consumption. The Callbacks mayalternatively be invoked once per N messages (N>1) to allow bulkprocessing.

Various aspects of the present invention enable the content index to bemaintained accurately, without consuming substantial processing power,as described below in further detail.

3. Updating Content Index

FIG. 2 is a flowchart illustrating the manner in which content index ismaintained to accurately reflect the changes to content according to anaspect of the present invention. The flowchart is described with respectto FIG. 1 for illustration. However, the features can be implemented invarious other environments as will be apparent to one skilled in therelevant arts by reading the disclosure provided herein. The flowchartbegins in step 201, in which control immediately passes to step 210.

In step 210, the data contents in transit to data repository areexamined to determine any changes to stored content (either data itselfor meta-data representing the properties of the stored data/file). Thus,with respect to FIG. 1, when data content being transferred to databaseserver 170 in the form of packets, the content of the packets may beexamined to determine any changes to the content stored in databaseserver 170.

In step 260, content index in index database 150 is updated to reflectthe determined changes. In general, the update operation can beperformed in any way consistent with the implementation of the servercontrolling/managing the content index. With respect to FIG. 1, amessage of suitable format may be sent to crawler 160 to cause crawler160 to update the content index. The flowchart ends in step 299.

By determining the changes based on content of packets in transit, itshould be appreciated the load on crawler and database servers isreduced. At the same time, the updates are performed timely, as desired.It should be appreciated that the features of FIG. 2 can be implementedin various locations/systems. The description is continued with respectto various example embodiments.

4. Application Server

FIG. 3A is a block diagram illustrating the manner in which changes aredetermined by monitoring the packets received by application server 130according to an aspect of present invention. The block diagram is showncontaining filter 320, application 350 and network interface 380. Eachblock is described below in further detail.

Network interface 380 provides the physical, electrical and protocolinterface for application server 130 to interface with network 120. Inaddition, network interface 380 receives data from application 350 andfilter 320 and transmits the received data on network 120 in a suitablepacket format. Network interface 380 receives packets on network 120(for example, from client systems 110A-110C) and forwards the receivedpacket to application 350 in a suitable format.

Application 350 receives packets through network interface 380 andeffects a read or write operation to database server 170 in response toappropriate commands encoded in the packet content. Alternatively,application 350 may merely pass-through the commands (in the form of SQLstatements) directed to database server 170. Application 350 furtherprocesses the data received from database server 170 (as a response tothe commands), and passes the response to the corresponding clientsystem as appropriate.

Filter 320 (implemented external to application 350, for example, as alower priority process which does not at least substantially impede theperformance of application 350) examines the data contents in transit todatabase server 170 and determines any changes to the content indatabase server 170. For example, filter 320 may ‘snoop’ the data sentto/from application 350 to determine the changes being effected to thecontent of database server 170.

Filter 320 may then cause the change information to be propagated to thecontent index by creating and sending update requests (by interfacingwith network interface 380). For example, filter 320 may sendappropriate triggers to cause crawler 160 to update the content index.The triggers may be implemented according to any suitable compatibleprotocols (between filter 320 and crawler 160).

Alternatively, the change requests may be stored in a storage andcrawler 160 may crawl the changes later, as described with respect toFIG. 3B. For conciseness the blocks operating similar to those in FIG.3A are shown with same reference numerals/labels and only thedifferences from that in FIG. 3A are described. Shown there in FIG. 3Bis storage 390 in addition to the components described in FIG. 3A.

Filter 320 identifies the changes to content of database server 170 bymonitoring the packets from/to application 350. Filter 320 stores theinformation (representing the content being modified, location etc.,) instorage 390. Crawler 160 then crawls storage 390 periodically, andupdates the content index.

In one embodiment, filter 320 merely stores data indicating theaddresses (or records or rows of tables in case of relational databases)at which changes have occurred, and crawler 160 may further crawl onlythe contents at the locations indicated in the storage 390 to obtainmodified data. Alternatively, filter 320 may store all informationneeded by crawler 160 in storage 390 such that crawler 160 merely needsto crawl storage 390 to update content index (index databases 50).

For example, the stored information may contain the followinginformation, with each line indicating a change to the correspondingcontent:

http://myhost/orders/order_acme.xls, 252525, SEC_CHANGED

http://myhost/orders/order_foo.doc, 252562, CONTENT_CHANGED

In the above records, the first field represents the URL, using whichthe content can be fetched, the second field represents the identifierof the document (which can be used to retrieve just meta-data, asopposed to content), and the third field represents the type of change.

The description is continued with respect to the manner in which packetsmay be examined/snooped to identify the change of contents of databaseserver 170.

5. Identifying Changes to Content

It should be appreciated that the determination of changes generallydepends on the specific packet formats and the protocols employed by theapplications. The manner in which changes to content may be identifiedis illustrated with respect to some example scenarios below.

In one embodiment, it is assumed that a Web application provides accessto the content over the internet using HTTP and the related packets maybe examined to determine that the contents of database server 170 aregoing to be changed. For example, (with reference to FIG. 4) an IPpacket in such a scenario would have a source address 420 equaling theIP address of the client system 110A from which the change request hasoriginated, destination address 425 equals the IP address of applicationserver 130, destination port 455 equals a value of 0×80 to representHTTP request, and payload 470 may indicate the change sought. Forexample, HTTP PUT request (defined in detail in RFC 2616) may be encodedin payload 470 to specify a change.

In one implementation, filter 320 is implemented as a Servlet filter,which examines the request parameters and the URL and decides whether achange in the content is being requested. Such a decision may be basedupon a simple URL pattern (however the approach can be extended to use acomplex matching algorithm). For example, the below parameters indicatethat an order form (a document) is being updated:

POST http://www.myserver.com/ordersys/updateOrderForm.jsp

host:myserver.com

content_type:14

id=232452&qty=25&price=90&loc=US.

It should be appreciated that changes to content can be identified invarious other ways depending on the specific environment. For example,when a client system (e.g., 110A) interacts with a FTP server (notshown, but connected to network 120 as a server), a change is generallyinitiated by a well known VERB (Command) such as PUT, DELETE. The packetpayload can be examined (e.g., in Netmon 510 described below) todetermine the presence of these commands. Alternatively, a FTP proxy(similar to filter 320) may trigger a content change event and pushes itto the crawler.

While the above implementations are described as being performed withinapplication/FTP servers, it should be appreciated that alternativeembodiments can be implemented in other blocks, without departing fromthe scope and spirit of various aspects of the present invention, asdescribed below in further detail below.

6. Network Monitor

According to an aspect of the present invention, a network monitor unitmay be implemented to monitor the packets destined to database server170, as illustrated with respect to FIG. 5. As shown there, Netmon 510may be implemented to monitor the packets on path 137 (e.g., implementedas an Intra-net LAN).

Netmon 510 may contain a network interface similar to network interface380, but forwarding all packets to a filter (implemented using the sameprinciples as filter 320 in examining packet content and determining theoccurrence of changes). The filter may then operate as described aboverespect to FIG. 3A or 3B. In general, netmon 510 and web/applicationserver 130 operate as monitoring systems which detect changes to desiredcontent (in this case in database server 170), and cause thecorresponding information to be propagated to the content index.

While the above implementations are described as being external todatabase server 170, more alternative embodiments may be implementedwithin database server 170 without departing from the scope and spiritof various aspects of the present invention as described below infurther detail.

7. Database Triggers

According to an aspect of the present invention, database triggers areprogrammed within database server 170 to send change related informationto crawler 160. As is well known in the relevant arts (and described,for example, in Chapter 10 of a book entitled, “Oracle Database 10 gPL/SQL Programming, By: Scott Urman et al, ISBN: 0072230665), databasetriggers are programs that implicitly start when an INSERT, UPDATE, orDELETE statement for a base table is executed. A trigger can containseveral SQL statements. A series of control structures are alsogenerally available for the application programmer. One can programloops or branches, for example, within a trigger.

Thus, when a database centric application (e.g., application 350)triggers a content update, a database trigger may be invoked post(after) update of a specific column in the Database. This column cancorrespond either to the content or to the metadata associated with thecontent. The trigger may be designed to push the content change event tocrawler 160, as described above. The implementation of the triggersconsistent with the interface requirements of crawler 160 will beapparent to one skilled in the relevant arts by reading the disclosureprovided herein.

The features described above can be implemented using a combination ofhardware, software and firmware as suitable for specific situations. Thedescription is continued with respect to an example embodiment in whichvarious features are operative by execution of appropriate softwareinstructions.

8. Digital Processing System

FIG. 6 is a block diagram illustrating the details of digital processingsystem 600 in which various aspects of the present invention areoperative by execution of appropriate software instructions. System 600may correspond to each of application server 130, netmon 510, FTPserver, etc., described above. System 600 may contain one or moreprocessors such as central processing unit (CPU) 610, random accessmemory (RAM) 620, secondary memory 630, graphics controller 660, displayunit 670, network interface 680, and input interface 690. All thecomponents except display unit 670 may communicate with each other overcommunication path 650, which may contain several buses as is well knownin the relevant arts. The components of FIG. 6 are described below infurther detail.

CPU 610 may execute instructions stored in RAM 620 to provide severalfeatures of the present invention. For example, tasks such as monitoringof packets for changes, sending appropriate commands to web crawlers,etc., may be performed in response to execution of the instructions. CPU610 may contain multiple processing units, with each processing unitpotentially being designed for a specific task. Alternatively, CPU 610may contain only a single general purpose processing unit. RAM 620 mayreceive instructions from secondary memory 630 using communication path650.

Graphics controller 660 generates display signals (e.g., in RGB format)to display unit 670 based on data/instructions received from CPU 610.Display unit 670 contains a display screen to display the images definedby the display signals. Input interface 690 may correspond to akey_board and/or mouse (not shown in FIG. 6). Network interface 680provides connectivity to a network (e.g., using Internet Protocol), andmay be used to communicate with the other systems of FIG. 1.

Secondary memory 630 may contain hard drive 635, flash memory 636 andremovable storage drive 637. Secondary memory 630 may store the data andsoftware instructions (e.g., methods instantiated by each of clientsystem), which enable system 600 to provide several features inaccordance with the present invention. Some or all of the data andinstructions may be provided on removable storage unit 640, and the dataand instructions may be read and provided by removable storage drive 637to CPU 610. Floppy drive, magnetic tape drive, CD_ROM drive, DVD Drive,Flash memory, removable memory chip (PCMCIA Card, EPROM) are examples ofsuch removable storage drive 637.

Removable storage unit 640 may be implemented using medium and storageformat

compatible with removable storage drive 637 such that removable storagedrive 637 can read

the data and instructions. Thus, removable storage unit 640 includes acomputer readable storage medium having stored therein computer softwareand/or data.

In this document, the term “computer program product” is used togenerally refer to removable storage unit 640 or hard disk installed inhard drive 635. These computer program products are means for providingsoftware to system 600. CPU 610 may retrieve the software instructions,and execute the instructions to provide various features of the presentinvention described above.

From the above, it may be appreciated that various embodiments providethe ability to detect change asynchronously without the application(which would cause changes to the data to be searched) having to detectand/or propagate such changes. As a result, the overhead on theapplications is reduced. In addition, the overhead on crawler typedevices is also reduced since the notification of updates are receivedasynchronously.

9. CONCLUSION

While various embodiments of the present invention have been describedabove, it should be understood that they have been presented by way ofexample only, and not limitation. Thus, the breadth and scope of thepresent invention should not be limited by any of the above describedexemplary embodiments, but should be defined only in accordance with thefollowing claims and their equivalents. Also, the various aspects,features, components and/or embodiments of the present inventiondescribed above may be embodied singly or in any combination in a datastorage system such as a database system.

What is claimed is:
 1. A system comprising: a data repository containinga plurality of stored entities; an index database storing indicesfacilitating identification of suitable ones of said plurality of storedentities matching a desired search criteria; a search engine receiving aquery with said desired search criteria and determining a search resultby examining said index database; and an application causing a change toat least one of said plurality of stored entities, wherein changes tosaid plurality of stored entities are detected external to saidapplication and data is sent for propagation of information on saidchanges to said index database.
 2. The system of claim 1, furthercomprising a monitoring system is designed to examine packets in transitto said data repository and generating a packet representing a change tosaid plurality of stored entities, wherein said packet is used topropagate information on said change to said index database.
 3. Thesystem of claim 2, further comprising a crawler which is designed toupdate said index database, wherein said monitoring system sends saidpacket to cause information on said change to be propagated to saidindex database.
 4. The system of claim 3, wherein said monitoring systemcomprises an application server executing said application and said datarepository comprises a database server, said application server beingpositioned in a path from a client system to said database server,wherein said application server receives packets from said client systemdesigned to trigger said change, wherein said application servercomprises a filter block which detects said change and generates saidpacket, wherein said filter block is implemented external to saidapplication.
 5. The system of claim 4, wherein said filter block sendssaid packet with a destination address of said crawler to cause saidcrawler to propagate said change to said index database.
 6. The systemof claim 4, wherein said filter block stores data indicating said changein a storage and said crawler crawls said storage periodically topropagate said change to said index database.
 7. The system of claim 2,wherein said monitoring system is connected to a communication pathcarrying said packets in transit to said data repository, saidmonitoring system passively monitoring said communication path to detectsaid change.
 8. The system of claim 1, wherein said data repositorycomprises a file transfer protocol (FTP) server, and said system furthercomprises a filter block which examines VERBs directed to said FTPserver to detect a change to said plurality of stored entities.
 9. Thesystem of claim 1, wherein said data repository comprises a databaseserver and a database trigger is provided to detect a change to saidplurality of stored entities and send for propagation to said indexdatabase.
 10. The system of claim 1, wherein one of said changesrepresents at least one of deletion of one of said plurality of storedentities, addition of a new entity to said plurality of stored entities,or a change of an attribute of one of said plurality of stored entities.11. A method of updating an index information in an index database withdata indicating a change in a data content stored by a data repository,wherein an application performs said change, said method comprising:detecting said change to said data content, wherein said detecting isimplemented external to said application; and creating an update requestupon said detecting, wherein said update request causes said indexinformation to be updated with information indicating said change tosaid data content.
 12. The method of claim 11, wherein said detectingcomprises examining packets in transit to said data repository, saidmethod further comprising generating a packet representing a change tosaid plurality of stored entities, wherein said packet is used topropagate information on said change to said index database.
 13. Themethod of claim 12, wherein a crawler which is designed to update saidindex database and wherein a monitoring system performs said generating,said method further comprises sending said packet from said monitoringsystem to said crawler to cause information on said change to bepropagated to said index database.
 14. The method of claim 12, furthercomprises storing data indicating said change in a storage and a crawlercrawls said storage periodically to propagate information representingsaid change to said index database.
 15. The method of claim 12, whereindetecting comprises passively monitoring a communication path from aclient system to said data depository to detect said change, whereinsaid client system causes said application to perform said change. 16.The method of claim 11, wherein said data repository comprises a filetransfer protocol (FTP) server, wherein said detecting comprisesexamining VERBs directed to said FTP server to detect said change.
 17. Acomputer readable medium carrying one or more sequences of instructionscausing a system to update an index information in an index databasewith data indicating a change in a data content stored by a datarepository, wherein an application performs said change, whereinexecution of said one or more sequences of instructions by one or moreprocessors contained in said system causes said one or more processorsto perform the actions of: detecting said change to said data content,wherein said detecting is implemented external to said application; andcreating an update request upon said detecting, wherein said updaterequest causes said index information to be updated with informationindicating said change to said data content.
 18. The computer readablemedium of claim 17, wherein said detecting comprises examining packetsin transit to said data repository, the actions further comprisinggenerating a packet representing a change to said plurality of storedentities, wherein said packet is used to propagate information on saidchange to said index database.
 19. The computer readable medium of claim18, wherein a crawler which is designed to update said index databaseand wherein said system comprises a monitoring system, said actionsfurther comprises sending said packet from said monitoring system tosaid crawler to cause information on said change to be propagated tosaid index database.
 20. The computer readable medium of claim 18,further comprises storing said information in a storage and a crawlercrawls said storage periodically to propagate said change to said indexdatabase.
 21. The computer readable medium of claim 17, whereindetecting comprises passively monitoring a communication path from aclient system to said data depository to detect said change, whereinsaid client system causes said application to perform said change.